As we are getting started, you will need to install and load the following R packages:
Our goal is to predict some outcome variable, Y, from a set of independent variables,
\[\begin{equation} \{ X_1, X_2, ..., X_p \} \end{equation}\]Let’s think back to linear regression. We have a basic model
\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p \end{equation}\]Our goal is to have the fewest independent variables while maintaining a good predictive capacity.
Linear regression has several advantages:
But these advantages do not come without a price:
With machine learning the idea is that a model can be tuned to a specific data set. This term arose in the 1950’s for a set of techniques to extract as much information as possible from a dataset without human intervention. Nowdays we search for good models where good means:
This is not to say that regression is obsolete, because it is still the starter model.
As mentioned above there are two types of machine learning:
Today we are talking about supervised learning. In this case we want to identify the function that that describes the relationship between inputs and outputs, kind of like a black box. For instance,
So suppose we have successive ozone readings, but we also have some other information and we want to choose from a set of independent variables.
One way that we can choose the independent variables is to create a regression tree such that the variables split the space into a series of partitions where the outcome variables are more and more alike.
Well that certainly sounds like a big mess, doesn’t it?
This technique is called recursive partitioning and it enables exploring the data space, making it easier to visualize decision rules of for a continuous outcome (a regression tree), like we are examining here, or for a categorical outcome (a decision tree), like we will talk about in a bit.
First, today’s vocabulary lesson:
| Term | Definition |
|---|---|
| Root node | Represents the entire pop’n, gets divided into 2 more homogeneous subsets |
| Splitting | Dividing a node into two or more sub-nodes |
| Node | Some (sub)set of observations |
| Decision Node | When a sub-node splits into further subnodes |
| Leaf/Terminal Node | Nodes that do not split further |
| Pruning | Removal of sub-nodes from a decision node; the opposite of splitting |
| Branch/Sub-Tree | A sub-section of an entire tree |
| Parent/Child Node | A node that is divided into submodes is the parent, the sub-node is the child |
One R tool that makes this a lot easier to implement is the package called rpart, and we will load it now.
library(rpart)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(NHANES)
library(mosaic)
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median,
## prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
We will proceed in a stepwise manner.
To grow the tree we will use the command
rpart(formula, data =, method =, control =) where
formula is in the format Y ~ X1 + X2 + … + Xp
data = specifies the data frame
method = “anova” for a regression tree
method = “class” for a decision tree
control = a series of optional parameters that controls the process of tree growth.
The output is an object called fit.
We can use the following functions to examine the results.
| Function | Description |
|---|---|
| printcp(fit) | display the cp table |
| plotcp(fit) | plot cross-validation results |
| rsp.rpart(fit) | plot approx. R-squared for 2 different splits |
| print(fit) | print results |
| summary(fit) | detailed results including surrogate splits |
| plot(fit) | plot decision tree |
| text(fit) | label the decision tree plot |
We will want to prune the tree in order to avoid overfitting the data. We will select the tree size the minimizes the cross-validated error (see the xerror column printed with printcp(fit)). We will then prune to the desired size using
prune(fit, cp=)
Specifically you use printcp() to select the complexity parameter associated with the minimum error, then place it into the prune() function.
We are going to explore the ozone dataset that comes as part of the faraway package. We want to predict Ozone from the other variables, but we are going to do it by partitioning the space. First let’s load up the dataset and name see what’s what.
library(faraway) #Install to get access to the dataset
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:mosaic':
##
## ilogit, logit
## The following object is masked from 'package:lattice':
##
## melanoma
## The following object is masked from 'package:rpart':
##
## solder
The variable names are as follows:
| Variable Name | Description |
|---|---|
| O3 | Daily Ozone Level |
| vh | Pressure height |
| wind | Wind speed at LAX |
| temp | Temperature |
| ibh | Inversion base height |
| dpg | Pressure gradient |
| ibt | Inversion base temp |
| vis | Visibility at LAX |
| doy | Day of year |
Let’s take a look:
summary(ozone) # What does the dataset look like?
## O3 vh wind humidity
## Min. : 1.00 Min. :5320 Min. : 0.000 Min. :19.00
## 1st Qu.: 5.00 1st Qu.:5690 1st Qu.: 3.000 1st Qu.:47.00
## Median :10.00 Median :5760 Median : 5.000 Median :64.00
## Mean :11.78 Mean :5750 Mean : 4.848 Mean :58.13
## 3rd Qu.:17.00 3rd Qu.:5830 3rd Qu.: 6.000 3rd Qu.:73.00
## Max. :38.00 Max. :5950 Max. :11.000 Max. :93.00
## temp ibh dpg ibt
## Min. :25.00 Min. : 111.0 Min. :-69.00 Min. :-25.0
## 1st Qu.:51.00 1st Qu.: 877.5 1st Qu.: -9.00 1st Qu.:107.0
## Median :62.00 Median :2112.5 Median : 24.00 Median :167.5
## Mean :61.75 Mean :2572.9 Mean : 17.37 Mean :161.2
## 3rd Qu.:72.00 3rd Qu.:5000.0 3rd Qu.: 44.75 3rd Qu.:214.0
## Max. :93.00 Max. :5000.0 Max. :107.00 Max. :332.0
## vis doy
## Min. : 0.0 Min. : 33.0
## 1st Qu.: 70.0 1st Qu.:120.2
## Median :120.0 Median :205.5
## Mean :124.5 Mean :209.4
## 3rd Qu.:150.0 3rd Qu.:301.8
## Max. :350.0 Max. :390.0
glimpse (ozone) # Let's take a little glimpse
## Observations: 330
## Variables: 10
## $ O3 <dbl> 3, 5, 5, 6, 4, 4, 6, 7, 4, 6, 5, 4, 4, 7, 5, 9, 4, 3, 4…
## $ vh <dbl> 5710, 5700, 5760, 5720, 5790, 5790, 5700, 5700, 5770, 5…
## $ wind <dbl> 4, 3, 3, 4, 6, 3, 3, 3, 8, 3, 6, 6, 3, 2, 5, 4, 5, 4, 3…
## $ humidity <dbl> 28, 37, 51, 69, 19, 25, 73, 59, 27, 44, 33, 19, 19, 19,…
## $ temp <dbl> 40, 45, 54, 35, 45, 55, 41, 44, 54, 51, 51, 54, 58, 61,…
## $ ibh <dbl> 2693, 590, 1450, 1568, 2631, 554, 2083, 2654, 5000, 111…
## $ dpg <dbl> -25, -24, 25, 15, -33, -28, 23, -2, -19, 9, -44, -44, -…
## $ ibt <dbl> 87, 128, 139, 121, 123, 182, 114, 91, 92, 173, 181, 135…
## $ vis <dbl> 250, 100, 60, 60, 100, 250, 120, 120, 120, 150, 40, 200…
## $ doy <dbl> 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,…
Before we go any further, let’s explore our data just a little bit.
# Plot panels for each covariate
ozonem <- melt(ozone, id.vars="O3")
ggplot(ozonem, aes(x=O3, y=value)) +
geom_point(alpha=0.4)+
scale_color_brewer(palette="Set2")+
facet_wrap(~variable, scales="free_y", ncol=3)
Based on these plots, we should be seeing temp, ibt, and possible vh, as important predictors.
The basic idea of a regression tree is that we split the dataset into increasingly finer partitions with the goal of reducing variability of our outcome variable. In this case our outcome variable is ozone, O3. What happens if we split the dataset into two partitions at the median of the variable temperature. What is the variability of O3 in those two subsets, and how does that compare to its variability in the overall dataset?
var(ozone$O3) # Overall variability of ozone
## [1] 64.18057
# Let's split at median of temperature
temp.lt.med <- filter(ozone, temp < 62)
temp.ge.med <- filter(ozone, temp >= 62)
var(temp.lt.med$O3)
## [1] 14.08899
var(temp.ge.med$O3)
## [1] 56.54459
So you see, we have created two subsets in which the variability of our outcome variable is reduced. Regression trees use that basic principle to keep growing, all the time reducing the variability.
Suppose we just wanted to partition on the basis of the variable temperature, we just call it up in the package rpart
fittemp <- rpart(O3 ~ temp, data = ozone)
Let’s look at the results:
printcp(fittemp) # Display the results
##
## Regression tree:
## rpart(formula = O3 ~ temp, data = ozone)
##
## Variables actually used in tree construction:
## [1] temp
##
## Root node error: 21115/330 = 63.986
##
## n= 330
##
## CP nsplit rel error xerror xstd
## 1 0.545699 0 1.00000 1.00655 0.077039
## 2 0.062419 1 0.45430 0.48443 0.042001
## 3 0.044122 2 0.39188 0.43025 0.038168
## 4 0.010000 3 0.34776 0.37354 0.034047
plotcp(fittemp) # Visualize cross-validation results
summary(fittemp) # Detailed summary of fit
## Call:
## rpart(formula = O3 ~ temp, data = ozone)
## n= 330
##
## CP nsplit rel error xerror xstd
## 1 0.54569932 0 1.0000000 1.0065523 0.07703864
## 2 0.06241925 1 0.4543007 0.4844346 0.04200132
## 3 0.04412234 2 0.3918814 0.4302453 0.03816783
## 4 0.01000000 3 0.3477591 0.3735352 0.03404676
##
## Variable importance
## temp
## 100
##
## Node number 1: 330 observations, complexity param=0.5456993
## mean=11.77576, MSE=63.98608
## left son=2 (214 obs) right son=3 (116 obs)
## Primary splits:
## temp < 67.5 to the left, improve=0.5456993, (0 missing)
##
## Node number 2: 214 observations, complexity param=0.04412234
## mean=7.425234, MSE=19.22572
## left son=4 (138 obs) right son=5 (76 obs)
## Primary splits:
## temp < 58.5 to the left, improve=0.2264444, (0 missing)
##
## Node number 3: 116 observations, complexity param=0.06241925
## mean=19.80172, MSE=47.22793
## left son=6 (72 obs) right son=7 (44 obs)
## Primary splits:
## temp < 79.5 to the left, improve=0.2405809, (0 missing)
##
## Node number 4: 138 observations
## mean=5.876812, MSE=12.87613
##
## Node number 5: 76 observations
## mean=10.23684, MSE=18.49654
##
## Node number 6: 72 observations
## mean=17.16667, MSE=36.47222
##
## Node number 7: 44 observations
## mean=24.11364, MSE=34.87345
# plot tree
plot(fittemp, uniform = TRUE, compress = FALSE)
text(fittemp, use.n = TRUE, all = TRUE, cex = 0.5)
Now we can throw all of the variables in, and see how it partitions:
fitall <- rpart(O3 ~ ., data = ozone)
So what does this partition look like?
# Now let's look at fitall
printcp(fitall) # Display the results
##
## Regression tree:
## rpart(formula = O3 ~ ., data = ozone)
##
## Variables actually used in tree construction:
## [1] doy dpg humidity ibh ibt temp vis
##
## Root node error: 21115/330 = 63.986
##
## n= 330
##
## CP nsplit rel error xerror xstd
## 1 0.545699 0 1.00000 1.00618 0.076679
## 2 0.073659 1 0.45430 0.47808 0.041647
## 3 0.053542 2 0.38064 0.44237 0.038930
## 4 0.026756 3 0.32710 0.39832 0.036725
## 5 0.023276 4 0.30034 0.39401 0.035869
## 6 0.023102 5 0.27707 0.38620 0.035204
## 7 0.015325 6 0.25397 0.38651 0.037520
## 8 0.010914 7 0.23864 0.36378 0.032832
## 9 0.010000 8 0.22773 0.37096 0.034457
plotcp(fitall) # Visualize cross-validation results
summary(fitall) # Detailed summary of fit
## Call:
## rpart(formula = O3 ~ ., data = ozone)
## n= 330
##
## CP nsplit rel error xerror xstd
## 1 0.54569932 0 1.0000000 1.0061807 0.07667919
## 2 0.07365906 1 0.4543007 0.4780831 0.04164690
## 3 0.05354155 2 0.3806416 0.4423662 0.03893035
## 4 0.02675573 3 0.3271001 0.3983198 0.03672511
## 5 0.02327600 4 0.3003443 0.3940069 0.03586936
## 6 0.02310206 5 0.2770684 0.3861990 0.03520433
## 7 0.01532492 6 0.2539663 0.3865122 0.03752028
## 8 0.01091366 7 0.2386414 0.3637758 0.03283189
## 9 0.01000000 8 0.2277277 0.3709644 0.03445737
##
## Variable importance
## temp ibt vh ibh humidity vis dpg doy
## 31 24 20 9 5 5 4 2
##
## Node number 1: 330 observations, complexity param=0.5456993
## mean=11.77576, MSE=63.98608
## left son=2 (214 obs) right son=3 (116 obs)
## Primary splits:
## temp < 67.5 to the left, improve=0.5456993, (0 missing)
## ibt < 190.5 to the left, improve=0.4551252, (0 missing)
## ibh < 3472.5 to the right, improve=0.3336499, (0 missing)
## vh < 5825 to the left, improve=0.3237221, (0 missing)
## vis < 145 to the right, improve=0.1932240, (0 missing)
## Surrogate splits:
## ibt < 191.5 to the left, agree=0.870, adj=0.629, (0 split)
## vh < 5815 to the left, agree=0.852, adj=0.578, (0 split)
## humidity < 70.5 to the left, agree=0.694, adj=0.129, (0 split)
## ibh < 2396 to the right, agree=0.691, adj=0.121, (0 split)
## vis < 95 to the right, agree=0.679, adj=0.086, (0 split)
##
## Node number 2: 214 observations, complexity param=0.05354155
## mean=7.425234, MSE=19.22572
## left son=4 (108 obs) right son=5 (106 obs)
## Primary splits:
## ibh < 3573.5 to the right, improve=0.2747856, (0 missing)
## ibt < 117.5 to the left, improve=0.2536357, (0 missing)
## temp < 58.5 to the left, improve=0.2264444, (0 missing)
## doy < 355.5 to the right, improve=0.1112585, (0 missing)
## dpg < -20.5 to the left, improve=0.1109613, (0 missing)
## Surrogate splits:
## ibt < 115.5 to the left, agree=0.869, adj=0.736, (0 split)
## vh < 5675 to the left, agree=0.696, adj=0.387, (0 split)
## temp < 56.5 to the left, agree=0.650, adj=0.292, (0 split)
## vis < 65 to the right, agree=0.631, adj=0.255, (0 split)
## dpg < 29.5 to the right, agree=0.617, adj=0.226, (0 split)
##
## Node number 3: 116 observations, complexity param=0.07365906
## mean=19.80172, MSE=47.22793
## left son=6 (55 obs) right son=7 (61 obs)
## Primary splits:
## ibt < 226.5 to the left, improve=0.2839022, (0 missing)
## temp < 79.5 to the left, improve=0.2405809, (0 missing)
## ibh < 2112.5 to the right, improve=0.1997126, (0 missing)
## vis < 145 to the right, improve=0.1626629, (0 missing)
## vh < 5865 to the left, improve=0.1371483, (0 missing)
## Surrogate splits:
## ibh < 1536.5 to the right, agree=0.836, adj=0.655, (0 split)
## vh < 5835 to the left, agree=0.819, adj=0.618, (0 split)
## temp < 79.5 to the left, agree=0.819, adj=0.618, (0 split)
## dpg < 45.5 to the right, agree=0.716, adj=0.400, (0 split)
## vis < 65 to the right, agree=0.621, adj=0.200, (0 split)
##
## Node number 4: 108 observations
## mean=5.148148, MSE=6.38546
##
## Node number 5: 106 observations, complexity param=0.02675573
## mean=9.745283, MSE=21.64267
## left son=10 (35 obs) right son=11 (71 obs)
## Primary splits:
## dpg < -9.5 to the left, improve=0.2462632, (0 missing)
## humidity < 37.5 to the left, improve=0.1923283, (0 missing)
## doy < 357 to the right, improve=0.1909689, (0 missing)
## temp < 58.5 to the left, improve=0.1563676, (0 missing)
## vis < 145 to the right, improve=0.1188114, (0 missing)
## Surrogate splits:
## humidity < 43 to the left, agree=0.868, adj=0.600, (0 split)
## ibt < 195.5 to the right, agree=0.783, adj=0.343, (0 split)
## doy < 351.5 to the right, agree=0.774, adj=0.314, (0 split)
## wind < 2.5 to the left, agree=0.736, adj=0.200, (0 split)
## ibh < 626 to the left, agree=0.726, adj=0.171, (0 split)
##
## Node number 6: 55 observations, complexity param=0.01532492
## mean=15.94545, MSE=23.21521
## left son=12 (10 obs) right son=13 (45 obs)
## Primary splits:
## humidity < 59.5 to the left, improve=0.2534326, (0 missing)
## ibh < 2785 to the right, improve=0.2252837, (0 missing)
## vis < 145 to the right, improve=0.1969641, (0 missing)
## ibt < 170 to the left, improve=0.1719023, (0 missing)
## vh < 5845 to the right, improve=0.1141336, (0 missing)
## Surrogate splits:
## ibh < 4043 to the right, agree=0.891, adj=0.4, (0 split)
## dpg < -2 to the left, agree=0.891, adj=0.4, (0 split)
## ibt < 168.5 to the left, agree=0.891, adj=0.4, (0 split)
## vis < 175 to the right, agree=0.891, adj=0.4, (0 split)
## vh < 5880 to the right, agree=0.855, adj=0.2, (0 split)
##
## Node number 7: 61 observations, complexity param=0.02310206
## mean=23.27869, MSE=43.38135
## left son=14 (8 obs) right son=15 (53 obs)
## Primary splits:
## doy < 306.5 to the right, improve=0.1843390, (0 missing)
## dpg < -10.5 to the left, improve=0.1709821, (0 missing)
## temp < 72.5 to the left, improve=0.1461252, (0 missing)
## ibh < 418 to the left, improve=0.1426827, (0 missing)
## humidity < 54.5 to the left, improve=0.1287632, (0 missing)
## Surrogate splits:
## temp < 71.5 to the left, agree=0.934, adj=0.500, (0 split)
## dpg < -15.5 to the left, agree=0.918, adj=0.375, (0 split)
## vis < 12 to the left, agree=0.885, adj=0.125, (0 split)
##
## Node number 10: 35 observations
## mean=6.457143, MSE=10.36245
##
## Node number 11: 71 observations, complexity param=0.023276
## mean=11.3662, MSE=19.24618
## left son=22 (40 obs) right son=23 (31 obs)
## Primary splits:
## ibt < 159 to the left, improve=0.3596705, (0 missing)
## vh < 5725 to the left, improve=0.2262516, (0 missing)
## doy < 76.5 to the left, improve=0.1854606, (0 missing)
## ibh < 1154.5 to the right, improve=0.1645331, (0 missing)
## temp < 51.5 to the left, improve=0.1233578, (0 missing)
## Surrogate splits:
## vh < 5725 to the left, agree=0.831, adj=0.613, (0 split)
## temp < 61.5 to the left, agree=0.746, adj=0.419, (0 split)
## ibh < 1154.5 to the right, agree=0.732, adj=0.387, (0 split)
## dpg < 9.5 to the right, agree=0.690, adj=0.290, (0 split)
## vis < 55 to the right, agree=0.662, adj=0.226, (0 split)
##
## Node number 12: 10 observations
## mean=10.8, MSE=16.76
##
## Node number 13: 45 observations
## mean=17.08889, MSE=17.45877
##
## Node number 14: 8 observations
## mean=16, MSE=49.75
##
## Node number 15: 53 observations, complexity param=0.01091366
## mean=24.37736, MSE=33.21609
## left son=30 (36 obs) right son=31 (17 obs)
## Primary splits:
## vis < 55 to the right, improve=0.13090170, (0 missing)
## humidity < 79.5 to the left, improve=0.08026820, (0 missing)
## temp < 90.5 to the left, improve=0.06998096, (0 missing)
## doy < 196.5 to the right, improve=0.06220180, (0 missing)
## ibt < 294 to the left, improve=0.05597715, (0 missing)
## Surrogate splits:
## humidity < 79.5 to the left, agree=0.755, adj=0.235, (0 split)
## doy < 266.5 to the left, agree=0.736, adj=0.176, (0 split)
## vh < 5915 to the left, agree=0.717, adj=0.118, (0 split)
## dpg < -0.5 to the right, agree=0.698, adj=0.059, (0 split)
##
## Node number 22: 40 observations
## mean=9.05, MSE=7.1975
##
## Node number 23: 31 observations
## mean=14.35484, MSE=18.93861
##
## Node number 30: 36 observations
## mean=22.94444, MSE=31.94136
##
## Node number 31: 17 observations
## mean=27.41176, MSE=22.35986
And for the plot:
plot(fitall, uniform = TRUE, compress = FALSE, main = "Regression Tree for Ozone Dataset")
text(fitall, use.n = TRUE, all = TRUE, cex = 0.5)
Let’s think about pruning the tree now.
# Prune the tree
pfit <- prune(fitall, cp = fitall$cptable[which.min(fitall$cptable[, "xerror"]), "CP"])
# Plot the pruned tree
plot(pfit, uniform = TRUE, compress = FALSE, main = "Pruned Regression Tree for Ozone")
text(pfit, use.n = TRUE, all = TRUE, cex = 0.5)
The package party provides nonparametric regression trees, and it also creates better graphics. Let’s giv it a try.
library(party)
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following objects are masked from 'package:partykit':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal, varimp
fitallp <- ctree(O3 ~ ., data = ozone)
plot(fitallp, main = "Conditional Inference Tree for Ozone")
In this package, tree growth is based on statistical stopping rules, thus pruning should not be necessary.
Another type of classifier is a decision tree. This comes from logistic regression. You will recall a couple of weeks ago we talked about the generalized linear model, and logistic regression is one form of such a model.
# Logistic regression on Kyphosis
summary(kyphosis)
## Kyphosis Age Number Start
## absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
## present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
## Median : 87.00 Median : 4.000 Median :13.00
## Mean : 83.65 Mean : 4.049 Mean :11.49
## 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
## Max. :206.00 Max. :10.000 Max. :18.00
mykyph <- kyphosis %>% mutate(kyph = (Kyphosis == "present"))
summary(mykyph)
## Kyphosis Age Number Start
## absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
## present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
## Median : 87.00 Median : 4.000 Median :13.00
## Mean : 83.65 Mean : 4.049 Mean :11.49
## 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
## Max. :206.00 Max. :10.000 Max. :18.00
## kyph
## Mode :logical
## FALSE:64
## TRUE :17
##
##
##
glm(kyph ~ Age + Number + Start, data = mykyph)
##
## Call: glm(formula = kyph ~ Age + Number + Start, data = mykyph)
##
## Coefficients:
## (Intercept) Age Number Start
## 0.261220 0.001066 0.052556 -0.030739
##
## Degrees of Freedom: 80 Total (i.e. Null); 77 Residual
## Null Deviance: 13.43
## Residual Deviance: 9.974 AIC: 70.21
Let’s turn for a moment to the dataset “kyphosis” data frame. Here the outcome of interest is kyphosis, a type of deformation, after surgery, and the independent variables are age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).
# Grow tree
summary(kyphosis)
## Kyphosis Age Number Start
## absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
## present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
## Median : 87.00 Median : 4.000 Median :13.00
## Mean : 83.65 Mean : 4.049 Mean :11.49
## 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
## Max. :206.00 Max. :10.000 Max. :18.00
fitk <- rpart(Kyphosis ~ Age + Number + Start, method = "class", data = kyphosis)
class(fitk)
## [1] "rpart"
# Display the results
printcp(fitk)
##
## Classification tree:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
## method = "class")
##
## Variables actually used in tree construction:
## [1] Age Start
##
## Root node error: 17/81 = 0.20988
##
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.176471 0 1.00000 1 0.21559
## 2 0.019608 1 0.82353 1 0.21559
## 3 0.010000 4 0.76471 1 0.21559
#Visualize the cross-validation results
plotcp(fitk)
# Get a detailed summary of the splits
summary(fitk)
## Call:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
## method = "class")
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.17647059 0 1.0000000 1 0.2155872
## 2 0.01960784 1 0.8235294 1 0.2155872
## 3 0.01000000 4 0.7647059 1 0.2155872
##
## Variable importance
## Start Age Number
## 64 24 12
##
## Node number 1: 81 observations, complexity param=0.1764706
## predicted class=absent expected loss=0.2098765 P(node) =1
## class counts: 64 17
## probabilities: 0.790 0.210
## left son=2 (62 obs) right son=3 (19 obs)
## Primary splits:
## Start < 8.5 to the right, improve=6.762330, (0 missing)
## Number < 5.5 to the left, improve=2.866795, (0 missing)
## Age < 39.5 to the left, improve=2.250212, (0 missing)
## Surrogate splits:
## Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split)
##
## Node number 2: 62 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.09677419 P(node) =0.7654321
## class counts: 56 6
## probabilities: 0.903 0.097
## left son=4 (29 obs) right son=5 (33 obs)
## Primary splits:
## Start < 14.5 to the right, improve=1.0205280, (0 missing)
## Age < 55 to the left, improve=0.6848635, (0 missing)
## Number < 4.5 to the left, improve=0.2975332, (0 missing)
## Surrogate splits:
## Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split)
## Age < 16 to the left, agree=0.597, adj=0.138, (0 split)
##
## Node number 3: 19 observations
## predicted class=present expected loss=0.4210526 P(node) =0.2345679
## class counts: 8 11
## probabilities: 0.421 0.579
##
## Node number 4: 29 observations
## predicted class=absent expected loss=0 P(node) =0.3580247
## class counts: 29 0
## probabilities: 1.000 0.000
##
## Node number 5: 33 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.1818182 P(node) =0.4074074
## class counts: 27 6
## probabilities: 0.818 0.182
## left son=10 (12 obs) right son=11 (21 obs)
## Primary splits:
## Age < 55 to the left, improve=1.2467530, (0 missing)
## Start < 12.5 to the right, improve=0.2887701, (0 missing)
## Number < 3.5 to the right, improve=0.1753247, (0 missing)
## Surrogate splits:
## Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split)
## Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split)
##
## Node number 10: 12 observations
## predicted class=absent expected loss=0 P(node) =0.1481481
## class counts: 12 0
## probabilities: 1.000 0.000
##
## Node number 11: 21 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.2857143 P(node) =0.2592593
## class counts: 15 6
## probabilities: 0.714 0.286
## left son=22 (14 obs) right son=23 (7 obs)
## Primary splits:
## Age < 111 to the right, improve=1.71428600, (0 missing)
## Start < 12.5 to the right, improve=0.79365080, (0 missing)
## Number < 3.5 to the right, improve=0.07142857, (0 missing)
##
## Node number 22: 14 observations
## predicted class=absent expected loss=0.1428571 P(node) =0.1728395
## class counts: 12 2
## probabilities: 0.857 0.143
##
## Node number 23: 7 observations
## predicted class=present expected loss=0.4285714 P(node) =0.08641975
## class counts: 3 4
## probabilities: 0.429 0.571
# Plot the tree
plot(fitk, uniform = TRUE, main = "Classification Tree for Kyphosis")
text(fitk, use.n = TRUE, all = TRUE, cex = 0.8)
Now let’s try pruning the tree:
# Prune the tree
prune_fitk <- prune(fitk, cp = fitk$cptable[which.min(fitk$cptable[, "xerror"]), "CP"])
class(prune_fitk)
## [1] "rpart"
###############################
# Remember when I said in class that if you keep knitting, you can get beyond this error. Well, just forget it. I've commented it out so you can just move on.
# Plot the pruned tree
#plot(prune_fitk, uniform = TRUE, main = "Pruned Classification Tree for Kyphosis")
#text(prune_fitk, use.n = TRUE, all = TRUE, cex = 0.8)
################
Let’s see what happens when we use the party package instead:
library(party)
fitallpk <- ctree(Kyphosis ~ ., data = kyphosis)
class(fitallpk)
## [1] "BinaryTree"
## attr(,"package")
## [1] "party"
plot(fitallpk, main = "Conditional Inference Tree for Ozone")
When you are looking at discrete data, you can no longer use the variability as a criterion for splitting. One alternative is the classification error rate. When we assign a predicted value to an obseration in a given region as the most commonly occurring class of observed values in that region, then classification error rate is the fraction of observations that don’t match the predicted value. So we can look at this across all regions using the Gini index defined as follows:
\[\begin{equation} G=\sum_{k=1}^{K} \hat{p}_{mk}(1-\hat{p}_{mk}) \end{equation}\]which is a measure of total variance across the K classes that we create. The partitioning programs use the Gini index to evaluate the splits, trying first the ones with the smallest values of the Gini.
Let’s do another dataset. Consider the NHANES dataset. We want to predict who has diabetes in the NHANES dataset.
# Call the dataset, select some variables, discard anybody with unknown values, and take a look
people <- NHANES %>% select(Age, Gender, Diabetes, BMI, HHIncome, PhysActive) %>% na.omit()
glimpse(people)
## Observations: 7,555
## Variables: 6
## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 58, 50, 33, 6…
## $ Gender <fct> male, male, male, female, female, female, female, mal…
## $ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ BMI <dbl> 32.22, 32.22, 32.22, 30.57, 27.24, 27.24, 27.24, 23.6…
## $ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 35000-44999, 7…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
# What is the marginal distribution of Diabetes?
tally(~ Diabetes, data = people, format = "percent")
## Diabetes
## No Yes
## 90.946393 9.053607
Now let’s try the partitioning:
# Recursive partitioning of diabetes on age, bmi, gender, and physical activity
whoIsDiabetic <- rpart(Diabetes ~ Age + BMI + Gender + PhysActive, data = people, control = rpart.control(cp = 0.005, minbucket = 30))
whoIsDiabetic
## n= 7555
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 7555 684 No (0.90946393 0.09053607)
## 2) Age< 52.5 5092 188 No (0.96307934 0.03692066) *
## 3) Age>=52.5 2463 496 No (0.79861957 0.20138043)
## 6) BMI< 39.985 2301 416 No (0.81920904 0.18079096) *
## 7) BMI>=39.985 162 80 No (0.50617284 0.49382716)
## 14) Age>=67.5 50 18 No (0.64000000 0.36000000) *
## 15) Age< 67.5 112 50 Yes (0.44642857 0.55357143)
## 30) Age< 60.5 71 30 No (0.57746479 0.42253521) *
## 31) Age>=60.5 41 9 Yes (0.21951220 0.78048780) *
# Plot the tree
plot(as.party(whoIsDiabetic))
How would you interpret this? Well you might say that if you are age <= 52 then you have a very small chance of having diabetes. If you are older than 52, then your chance is greater, and if your BMI is above 40, then your risk increases even more. Let’s look at how this partitions.
# Graph as partition
ggplot(data = people, aes(x = Age, y = BMI)) +
geom_count(aes(color = Diabetes), alpha = 0.5) +
geom_vline(xintercept = 52.5) +
geom_segment(x=52.5, xend = 100, y = 39.985, yend = 39.985) +
geom_segment(x = 67.5, xend = 67.5, y = 39.985, yend = Inf) +
geom_segment(x = 60.5, xend = 60.5, y = 39.985, yend = Inf) +
annotate("rect", xmin = 60.5, xmax = 67.5, ymin = 39.985, ymax = Inf, fill = "blue", alpha = 0.1)
This is a nice way to visualize a complex model. But the downside is that what recursive partitioning is doing is dividing up the space so that everybody in the same rectangle is receiving the same prediction.
Random forests improve predictive accuracy by generating a large number of bootstrapped trees, then averaging across all trees. This is implemented in (what else?) the randomForestSRC package. We will also use the ggRandomForests package for plotting nice graphs.
library(RColorBrewer)
library(plot3D)
library(parallel)
library(randomForestSRC)
##
## randomForestSRC 2.8.0
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
##
## Attaching package: 'randomForestSRC'
## The following object is masked from 'package:purrr':
##
## partial
library(ggRandomForests)
##
## Attaching package: 'ggRandomForests'
## The following object is masked from 'package:randomForestSRC':
##
## partial.rfsrc
set.seed(131)
# Random Forest for the ozone dataset
fitallrf <- rfsrc(O3 ~ ., data = ozone, ntree = 100, tree.err=TRUE)
# view the results
fitallrf
## Sample size: 330
## Number of trees: 100
## Forest terminal node size: 5
## Average no. of terminal nodes: 66.33
## No. of variables tried at each split: 3
## Total no. of variables: 9
## Resampling used to grow trees: swr
## Resample size used to grow trees: 330
## Analysis: RF-R
## Family: regr
## Splitting rule: mse *random*
## Number of random split points: 10
## % variance explained: 74.22
## Error rate: 16.55
We see that there are 100 trees in our forest, built from 330 observations and 9 independent variables. It randomly selected 3 candidate variables at each split, and terminated with nodes of not fewer than 5 observations
# Plot the OOB errors against the growth of the forest
gg_e <- gg_error(fitallrf)
plot(gg_e)
## Warning: Removed 99 rows containing missing values (geom_path).
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
We also see that after about 70 trees, there is no substantial improvement in the error rate.
Random forests work on the basis of bootstrap aggregation (“bagging”). Each boot strap sample selects on average a little less than two-thirds of the sample to be in the training set. The remaining third of the observations, the Out-of-Bag (OOB) sample, are used as a the test set for each of the trees in the forest. An OOB prediction error estimate is calculated for each observation by calculating the response over the set of trees which where not trained with that particular observation. This error rate has been shown to be nearly identical to n-fold cross-validation. This feature of random forests allows us to obtain both model fit and validation in one fell swoop.
The gg-rfsrc function extracts the OOB prediction estimates from the random forest.
# Plot the predicted ozone values
plot(gg_rfsrc(fitallrf), alpha = 0.5)
With random forests, the goal is not be parsimonious, but to use all variables available to construct the response predictor. Also, random forests are non-parametric in that they don’t require specification of the functional form of covariates to the response. For these reasons, there is no explicit p-value or significance testing for variable selection. Instead, a random forest will ascertain which variable contribute to the prediction through the split rule optimization, choosing variable that optimize the separation of observations. There are two main approaches for variable selection with random forests: Variable Importance and Minimal Depth.
Variable importance (VIMP) inovlves “noising up” each variable in turn. The VIMP is then the difference between the prediction error when the variable of interest is “noised up” and the prediction error when it is not, ultimately the OOB prediction error before and after permutation.
The interpretation is thus:
So if a varible has negative VIMP, this means that noise is more informative than that variable.
Given this interpretation, we will ignore variables with negative VIMPs and also those with VIMPs near zero.
# Plot the VIMP rankins of independent variables
plot(gg_vimp(fitallrf))
## Warning in gg_vimp.rfsrc(fitallrf): rfsrc object does not contain VIMP
## information. Calculating...
In our random forest, the variables ibt and temp have the largest VIMPs, and there is a sizeable difference with teh remaining variables. So we should focus our attention on ibt and temp over the other variables.
Also, in this random forest, all VIMP values are positive, although some are small. When there are both positive and negative VIMP values, plot(gg_vimp()) will color VIMP by the sign of the measure.
With VIMP, prognostic factors are determined by testing the forest prediction under alternative data settings and ranking the most important variables according to their impact on predictive ability of the forest. Alternatively we can inspect the forest construction to rank variables. Minimal depth assumes that variables with high impact on prediction are those that most frequently split nodes close to the tree trunks (i.e., at the root nod), partitioning the sample into large subsamples.
Within a tree, node levels are numbered based on relative distance to the trunk, which is 0. Minimal depth measures the important risk factors by averaging the depth of the first time a variable is used over all trees in the forest. Lower values mean that the variable is important in splitting large groups of observations.
The maximal subtree for variable x is the largest subtree whose root node splits on x. All parent nodes of x’s maximal subtree have nodes that split on variables other than x. If a variable does not split the root node, it can have more than one maximal subtree, or a maximal subtree may not exist if there are no splits on that variable.
Thus minimal depth is a surrogate measure of the predictive ability of the variable. The smaller the minimal depth, the more impact the variable has in sorting observations, and thus on forest prediction.
# Select the variables
varsel_ozone <- var.select(fitallrf)
## minimal depth variable selection ...
##
##
## -----------------------------------------------------------
## family : regr
## var. selection : Minimal Depth
## conservativeness : medium
## x-weighting used? : TRUE
## dimension : 9
## sample size : 330
## ntree : 100
## nsplit : 10
## mtry : 3
## nodesize : 5
## refitted forest : FALSE
## model size : 9
## depth threshold : 5.1041
## PE (true OOB) : 16.5464
##
##
## Top variables:
## depth vimp
## temp 1.16 NA
## ibt 1.49 NA
## vh 2.05 NA
## ibh 2.09 NA
## humidity 2.51 NA
## vis 2.57 NA
## dpg 2.79 NA
## doy 2.92 NA
## wind 4.39 NA
## -----------------------------------------------------------
glimpse(varsel_ozone)
## List of 6
## $ err.rate : num 16.5
## $ modelsize : int 9
## $ topvars : chr [1:9] "temp" "ibt" "vh" "ibh" ...
## $ varselect :'data.frame': 9 obs. of 2 variables:
## ..$ depth: num [1:9] 1.16 1.49 2.05 2.09 2.51 2.57 2.79 2.92 4.39
## ..$ vimp : num [1:9] NA NA NA NA NA NA NA NA NA
## $ rfsrc.refit.obj: NULL
## $ md.obj :List of 11
## ..$ order : num [1:9, 1:2] 2.05 4.39 2.51 1.16 2.09 2.79 1.49 2.57 2.92 4.26 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ count : Named num [1:9] 0.1236 0.1018 0.1451 0.0907 0.111 ...
## .. ..- attr(*, "names")= chr [1:9] "vh" "wind" "humidity" "temp" ...
## ..$ nodes.at.depth : num [1:10000, 1:100] 2 4 7 10 11 11 7 5 3 3 ...
## ..$ sub.order : NULL
## ..$ threshold : num 5.1
## ..$ threshold.1se : num 5.28
## ..$ topvars : chr [1:9] "vh" "wind" "humidity" "temp" ...
## ..$ topvars.1se : chr [1:9] "vh" "wind" "humidity" "temp" ...
## ..$ percentile : Named num [1:9] 0.1608 0.4096 0.194 0.0905 0.1685 ...
## .. ..- attr(*, "names")= chr [1:9] "vh" "wind" "humidity" "temp" ...
## ..$ density : Named num [1:18] 0.0448 0.0753 0.119 0.1363 0.123 ...
## .. ..- attr(*, "names")= chr [1:18] "0" "1" "2" "3" ...
## ..$ second.order.threshold: num 7.91
# Save the gg_minimal_depth object for later use
gg_md <- gg_minimal_depth(varsel_ozone)
# Plot the object
plot(gg_md)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
In general with VIMP we examine the values and rather arbitrarily select some point along the ranking where there is a large difference in VIMP values. The minimal depth approach is a bit more quantitative. How do they compare?
We can use the gg_minimal_vimp function to compare rankings between minimal depth and VIMP.
# Plot minimal depth v VIMP
gg_mdVIMP <- gg_minimal_vimp(gg_md)
plot(gg_mdVIMP)
In this case we see that the measures are in agreement. If we were to see points above the red dashed line, then they would be ranked higher by VIMP than by minimal depth, indicating that variables are sensitive to misspecification. Points falling below the line have a higher minimal depth ranking, indicating that they are better at dividing large portions of the population. The further the points are from the line, the more discrepant the measures.
Random forests might seem a little like a black box, but it is possible to use this method to express a functional form for the predictor:
\[\begin{equation} \hat{f}_{rf}(x) = f(x) \end{equation}\]Usually this is a complex functional form. We can use graphical methods to examine the predicted response as a function of covariates using variable dependence plots. These show predicted response as a function of a covariate of interest, with each observation represented as a point on the plot. Each predicted point is based on the individual observation of all other covariates and so is a function not only of the covariate of interest.
#Create the variable dependence object from the random forest
gg_v <- gg_variable(fitallrf)
# Use the top ranked minimal depth variables only, plotted in minimal depth rank order
xvar <- gg_md$topvars
# Plot the variable list in a single panel plot
plot(gg_v, xvar = xvar, panel = TRUE, alpha = 0.4) +
labs(y="Predicted Ozone reading", x="")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
These look very similar to the EDA plots we made at the beginning, albeit with a transposed axis. Panels are sorted to match the order of variables in the xvar argument and include a smooth loess line with 95% shaded confidence band, indicating the trend of prediction dependence over the covariate values.
Let’s try this with a decision tree rather than a regression tree, starting with the kyphosis dataset. First do the EDA:
mykyph <- mykyph %>% select(kyph, Age, Number, Start)
kyphm <- melt(mykyph, id.vars = "kyph")
ggplot(kyphm, aes(x=kyph, y = value, fill = kyph)) +
geom_boxplot(notch = F) +
facet_wrap(~variable, scales = "free_y", ncol=3)
set.seed(5491)
mykyph <- mykyph %>% mutate(fkyph = as.factor(kyph))
# Random forest for the kyphosis dataset
kyphfr <- rfsrc(fkyph ~., data = mykyph, ntree = 1000, tree.err = TRUE)
# Let's see what we've got
kyphfr
## Sample size: 81
## Frequency of class labels: 64, 17
## Number of trees: 1000
## Forest terminal node size: 1
## Average no. of terminal nodes: 3.991
## No. of variables tried at each split: 2
## Total no. of variables: 4
## Resampling used to grow trees: swr
## Resample size used to grow trees: 81
## Analysis: RF-C
## Family: class
## Splitting rule: gini *random*
## Number of random split points: 10
## Normalized brier score: 0.87
## AUC: 100
## Error rate: 0, 0, 0
##
## Confusion matrix:
##
## predicted
## observed FALSE TRUE class.error
## FALSE 64 0 0
## TRUE 0 17 0
##
## Overall error rate: 0%
Plot the OOB errors against the growth of the forest
# Plotting OOB errors agains number of trees
gg_ek <- gg_error(kyphfr)
plot(gg_ek)
## Warning: Removed 2997 rows containing missing values (geom_path).
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
As we can see, the errors quickly drop to a very small amount.
This isn’t very much fun. Let’s try to see if we have more fun with the NHANES Diabetes dataset.
First some EDA:
peoplec <- people %>% select(Diabetes, Age, BMI)
peoplec <- melt(peoplec, id.vars = "Diabetes")
class(peoplec)
## [1] "data.frame"
ggplot(peoplec, aes(x=Diabetes, y = value, fill = Diabetes)) +
geom_boxplot(notch = F) +
facet_wrap(~variable, scales = "free_y", ncol=3)
summary(people)
## Age Gender Diabetes BMI
## Min. :12.00 female:3791 No :6871 Min. :13.30
## 1st Qu.:27.00 male :3764 Yes: 684 1st Qu.:23.40
## Median :42.00 Median :27.12
## Mean :42.94 Mean :28.21
## 3rd Qu.:57.00 3rd Qu.:31.80
## Max. :80.00 Max. :81.25
##
## HHIncome PhysActive
## more 99999 :1879 No :3251
## 75000-99999: 884 Yes:4304
## 25000-34999: 769
## 35000-44999: 706
## 45000-54999: 657
## 55000-64999: 528
## (Other) :2132
Let’s try creating the random forests now:
set.seed(1030)
peoplec <- people %>% select(Diabetes, Age, BMI, Gender, PhysActive) %>% mutate(fdiab = as.factor(Diabetes))
summary(peoplec)
## Diabetes Age BMI Gender PhysActive
## No :6871 Min. :12.00 Min. :13.30 female:3791 No :3251
## Yes: 684 1st Qu.:27.00 1st Qu.:23.40 male :3764 Yes:4304
## Median :42.00 Median :27.12
## Mean :42.94 Mean :28.21
## 3rd Qu.:57.00 3rd Qu.:31.80
## Max. :80.00 Max. :81.25
## fdiab
## No :6871
## Yes: 684
##
##
##
##
is.factor(peoplec$fdiab)
## [1] TRUE
class(peoplec)
## [1] "tbl_df" "tbl" "data.frame"
peoplec <- as.data.frame(peoplec)
diabrf <- rfsrc(fdiab ~ Age + Gender + BMI + PhysActive, data = peoplec, ntree = 1000, tree.err = TRUE)
diabrf
## Sample size: 7555
## Frequency of class labels: 6871, 684
## Number of trees: 1000
## Forest terminal node size: 1
## Average no. of terminal nodes: 637.447
## No. of variables tried at each split: 2
## Total no. of variables: 4
## Resampling used to grow trees: swr
## Resample size used to grow trees: 7555
## Analysis: RF-C
## Family: class
## Splitting rule: gini *random*
## Number of random split points: 10
## Normalized brier score: 19.3
## AUC: 90.92
## Error rate: 0.06, 0.02, 0.49
##
## Confusion matrix:
##
## predicted
## observed No Yes class.error
## No 6749 122 0.0178
## Yes 332 352 0.4854
##
## Overall error rate: 6.01%